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Abstract 

Background: F2 resource populations have been used extensively to map QTL segregating between pig breeds. A 
limitation associated with the use of these resource populations for fine mapping of QTL is the reduced number of 
founding individuals and recombinations of founding haplotypes occurring in the population. These limitations, 
however, become advantageous when attempting to impute unobserved genotypes using within family 
segregation information. A trade-off would be to re-type F2 populations using high density SNP panels for founding 
individuals and low density panels (tagSNP) in F2 individuals followed by imputation. Subsequently a combined 
meta-analysis of several populations would provide adequate power and resolution for QTL mapping, and could be 
achieved at relatively low cost. Such a strategy allows the wealth of phenotypic information that has previously 
been obtained on experimental resource populations to be further mined for QTL identification. In this study we 
used experimental and simulated high density genotypes (HD-60K) from an F2 cross to estimate imputation 
accuracy under several genotyping scenarios. 

Results: Selection of tagSNP using physical distance or linkage disequilibrium information produced similar 
imputation accuracies. In particular, tagSNP sets averaging 1 SNP every 2.1 Mb (1,200 SNP genome-wide) yielded 
imputation accuracies {lA) close to 0.97. If instead of using custom panels, the commercially available 9K chip is 
used in the F2, lA reaches 0.99. In order to attain such high imputation accuracy the Fq and Fi generations should 
be genotyped at high density. Alternatively, when only the Fq is genotyped at HD, while Fi and F2 are genotyped 
with a 9K panel, lA drops to 0.90. 

Conclusions: Combining 60K and 9K panels with imputation in F2 populations is an appealing strategy to 
re-genotype existing populations at a fraction of the cost. 



Background 

The search for regions in the genome containing genetic 
variants that affect production traits requires experimental 
populations to identif)^ the segregating QTL within and 
between parental populations [1]. The F2 design is com- 
monly used to map QTL segregating in divergent parental 
lines [2,3]. To produce reliable analyses of association or 
genetic evaluations using genomic information, a great 
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number of individuals with phenotypes and high density 
(HD) genotypes are required [4], However, HD genotypes 
for large numbers of animals are expensive to obtain [5,6]. 
A way of reducing cost is to genotype individuals from 
base generations (parents) in HD, and their more numer- 
ous descendants at low density (LowD) [6,7]. Then, using 
selected SNP from the HD panel, called tagSNP, the non- 
typed SNP are imputed with high accuracy [7]. Imputing 
HD genotypes of progeny from LowD genotypes, condi- 
tional on grandparental and parental HD genotypes, may 
result in higher imputation accuracies than those obtained 
using a reference panel from unrelated individuals [7-9]. 
This is because HD genotypes from base generations can 
be traced within family by means of co- segregation or 
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descendant probabilities [6] while searching for the phase 
of parental alleles [7]. 

Most studies on genotype imputation of livestock spe- 
cies have been performed with purebreds [4,7,9-13], and 
genotype imputation from crossbreds has been largely ab- 
sent. With regard to agricultural plant species, studies on 
genotype imputation have used inbred lines [8], recombin- 
ant inbred lines (RILs) in Nested Association Mapping 
(NAM) designs [14,15], and Multiparent Advanced Gen- 
eration Inter-Cross studies (MAGIC) [16]. Genotype im- 
putation has also been employed in human studies of 
genome-linkage analysis for test association of candidate 
transcriptional regulators with gene expression [17]; and 
also in a model organism in biomedical research such as 
the mouse, imputation of genotypes from crosses of in- 
bred lines was used to identify candidates genes for com- 
plex disease [18,19]. 

Imputing genotypes in humans, plants, livestock, or 
model organisms, is similar in the sense that a small num- 
ber of founding individuals can be genotyped at high 
density, and the bulk of the mapping population can be 
genotyped at low density using linkage information. In this 
paper we focus on imputing F2 individuals from a three 
generation (Fq, Fi and F2) population of Duroc x Pietrain 
crossbred pigs. The Fq and Fi animals were genotyped in 
HD (60K). The Fq populations used to map QTL in pigs 
are typically composed of a small number of animals (in 
our case, 4 males and 15 females) [1,20-22]. As it is 
expected that few recombinations occur in the first gener- 
ations, these populations have low resolution to map QTL 
[23]. However, and for the same reason, there is a poten- 
tial for attaining high accuracy of imputation. The latter 
effect can be taken to advantage for imputing HD geno- 
types from inexpensive LowD F2 genotypes, which subse- 
quently allows combining existing data from experimental 
populations in a meta-analysis for association. There are 
several reasons for this strategy to be attractive. First, sev- 
eral of these populations have been recently created 
[21,22,24,25] and DNA from these animals is available. 
Second, extensive datasets of phenotypes have been re- 
corded for these populations including for traits that are 
expensive or difficult to measure, such as the content of 
intramuscular fat and composition of fatty acids [25], age 
at puberty in gilts [22], and meat tenderness [26]. Finally, 
these populations are generally developed from breeds 
that are divergent for some traits of interest such as fat/ 
lean content, meat quality or reproductive efficiency, take 
for example: Duroc x Pietrain [1,21], Duroc x Landrace 
[24], Duroc X Large- White [25], White-Duroc x Erhualian 
[22], Meishan x Duroc [27], Berkshire x Duroc [20]. 

Therefore, it follows that imputation of F2 LowD to HD 
genotypes with high accuracy would be useful and 
convenient, providing a cost effective strategy as a first 
step for association analyses or meta-analyses. Different 



methods have been employed to select tagSNP in LowD 
panels. Two of them are: 1) imposing restrictions on the 
minimum value of linkage disequilibrium (LD) or ij be- 
tween markers [28], 2) selection of tagSNP that are evenly 
spaced using the physical distance between markers 
[4,11,12]. In addition, commercial chips are also available 
with medium density segregating SNP selected from sev- 
eral populations, as for example for bovine [29] and pig 
[10]. A question arises of how many SNP are needed to at- 
tain a high accuracy of imputation for a given F2 popula- 
tion. Another question is whether a specific chip has to be 
custom designed, or whether current commercially avail- 
able chips can be used. Finally, it is important to deter- 
mine whether both the Fq and Fi have to be genotyped at 
HD, or if just genotyping the Fq is adequate to obtain a 
high accuracy of imputation in the F2. 

The goal of this research was to estimate the accuracy of 
imputation at HD (60K), from LowD F2 genotypes for 
a Duroc x Pietrain population, using different genoty- 
ping schemes. The strategies were evaluated by means of 
Monte Carlo simulation, conditional on the genotypes 
from animals in the first two generations (Fq and Fi). In 
doing so, two methods of tagSNP selection were consid- 
ered and their results were compared to those obtained 
from a commercial panel chip (9K). In addition to simula- 
tions, accuracy of imputation was evaluated using experi- 
mental data, taking advantage of a reduced number of F2 
animals that were genotyped at HD. 

Results 

Linkage disequilibrium and selection of tagSNP 

Table 1 displays the number of tagSNP selected with dif- 
ferent values of LD in an intermediate size chromosome 
(SSC12), reflected by the measure As the value of ij in- 
creases, more tagSNP are selected and lA increases. As an 
example, when = 0.2, 79 tagSNP were selected at an 
average distance of 0.79 Mb and at an accuracy of 0.970. 
On the other hand for = 0.5, 399 tagSNP were selected, 
positioned at an average distance of 0.16 Mb with lA being 
equal to 0.982 (Table 1). 

Evenly spaced SNP 

The lA using tagSNP selected using either LD information 
or evenly spaced SNP were similar. For example, the lA of 
non-typed SNP on SSC12 were 0.973 and 0.970, respect- 
ively, for 80 evenly spaced SNP as compared with 79 
tagSNP selected with = 0.2 (Figure 1). Results for other 
densities of tagSNP were similar (Figure 1). Moreover, 
evenly spaced tagSNP sets of comparable density across 
chromosomes yielded similar accuracies. Thus, for ex- 
ample an average inter-marker distance of 2.1 Mb, 140 
tagSNP on chromosome 1 and 30 tagSNP on chromo- 
some 12 produced lA of 0.969 and 0.968, respectively 
(Figure 2). In summary, a minimum of 1,200 evenly 
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Table 1 Accuracy of imputation using tagSNP selected for different values of rt on chromosome 12 



Tt Number of tagSNP Number of SNP genomewide Average distance between SNP (Mb) Imputation accuracy {lA) 



0.1 


33 


1295 


1.86 


0.960 


0.2 


79 


3100 


0.79 


0.970 


0.3 


158 


6199 


040 


0.976 


0.4 


266 


10436 


0.24 


0.980 


0.5 


399 


15654 


0.16 


0.982 



/f = Threshold of LD in statistical selection, Number of tagSNP = Number of SNP selected for a particular /f. Number of SNP Genomewide = The equivalent number 
of SNP that are needed genome-wide to keep the same average inter-marker distance. Average distance between SNP (Mb) = Average distance between tagSNP 
selected. Imputation accuracy (M) = Imputation accuracy of non-tagSNP. Results using simulated data. 



spaced tagSNP across the genome (average distance = 2,1 
Mb) are needed in this F2 population to attain imputation 
accuracy lA > 0.97 when the Fq and Fi are genotyped with 
a SNP60 chip. 

Imputed genotypes in experimental F2 animals 
9K commercial chip 

The values of lA were calculated for two scenarios and for 
each chromosome, using a 9K SNP list that was developed 
for producing a commercial LowD panel (GeneSeek, Inc., 
Lincoln, NE, USA; described in Badke et al. [10]). 

Imputation accuracies lA were 0.90 and 0.99 when the 
Fi was genotyped at low or high density, respectively 
(Figure 3). In the latter case, although the accuracy was 
high in all chromosomes (0.99), SNP in some regions were 
imputed with lower accuracy (Figure 4). High lA in the F2 



were obtained across all SNP when the Fi was genotyped 
at HD (Figure 4a,b). However, when the F^ was genotyped 
at LowD, lA in F2 individuals decreased along the whole 
chromosome (Figure 4c,d). A logical question to consider 
is the following: how much accuracy is gained when in- 
cluding pedigree information, when compared with the 
use of population-wise LD as the unique source of infor- 
mation? To answer this, the imputation was performed 
again using as reference panel the genotypes of Fq and Fi 
animals and the F2 at LowD, but without specif)^ing the 
pedigree of the F2S. In other words, the F2 animals were 
assumed unrelated and their parents were unknown. For 
chromosome 1 the results are displayed in Figure 5. No- 
tice that the average lA in the F2 was equal to 0.90. There- 
fore, the lA was lower than when the information on 
relationships was used (0.99, Figure 4a,b). This indicates 
that the inclusion of HD genotypes from related animals 
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Figure 1 Accuracy of imputation (lA) using tagSNP selected using LD information or evenly spaced. Imputation accuracy as a function of 
number of tagSNP selected: evenly spaced tagSNP panels (green dots), tagSNP-LD panels (red squares). Results shown correspond to simulated 
data from chromosome 12. 
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Figure 2 Accuracy of imputation (lA) as a function of average tagSNP distance for evenly spaced panels on chromosomes 1 and 12. 

Imputation accuracy as a function of tine average distance between evenly spaced tagSNP in megabases (Mb) for seven SNP panels on simulated 
data from chromosome 1 (red squares) and chromosome 12 (blue dots). 



CO 

d 



CD 

d 



CM 

d 



o 

d ' 




9 10 11 12 

Chromosome 

Figure 3 Accuracy of imputation (lA) for SNP on 60K chip using the 9K panel as tagSNP. Average accuracy of imputation for each 
chromosome using experimental data: Blue bars correspond to the case of Fq at high density (HD), Ft and F2 at low density (LowD). Red bars 
correspond to gain in accuracy of imputation when the Fi is genotyped at HD. 
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Figure 4 Accuracy of imputation (lA) across chromosomes 1 and 12 under two genotyping scenarios. Generation Fq and Fi at high 
density, F2 at low density: chromosome 1 (a), chromosome 12 (b). Generation Fq at high density, F^ and F2 at low density: chromosome 1 (c 
chromosome 12 (d). The blue line displays a local regression fit of the data. All results were obtained using the experimental data. 



and explicitly specifying paternities greatly increases ac- 
curacy of imputation. 

The lA from both genotyping scenarios (Figures 3 and 
4) reflect an average drop of 0.1 when the Fi is genotyped 
at LowD. To gain further insight, the simulated haplo- 
types of two families were used to calculate accuracy of 
imputation in each scenario. When the Fi is genotyped 
at LowD, the results showed that the phase error among 
the SNP that are not tagSNP increased. This loss of ac- 
curacy in determining the SNP phase can be traced back 
to the Fq generation in which the non-tagSNP are also 
phased with low accuracy. Furthermore, the proportion 
of SNP with uncertain phase in the Fi genotyped at HD 
was 4%, and the ensuing accuracy of haplotyping was 
0.97. However, when the Fi was genotyped at LowD the 
proportion of SNP with uncertain phase increased to 
30%, and the corresponding accuracy of haplotyping for 



the non-tagSNP of Fi genotypes dropped to 0.85. In a 
further analysis with the Fi generation genotyped at HD 
and used as a reference population (ignoring Fq ge- 
notypes), this resulted in 43% of non-tagSNP with uncer- 
tain phase in the F^ at HD, and the haplotyping accuracy 
was even lower (0.78). These results suggest that, in 
order to have a high accuracy of imputation for non- 
tagSNP in F2 genotypes, certainty of the phase in the Fi 
genotypes is required. Such accurately estimated phase is 
guaranteed when two generations of HD genotypes (Fq, Fi) 
are available. 

A closer look at Figures 4 and 5 indicates that the pos- 
ition of the SNP had some effect over M. Therefore, we 
investigated the relationship between single SNP imput- 
ation accuracy and each SNPs MAF, distance to the 
nearest tagSNP, and allelic frequency difference between 
founding breeds. 
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Figure 5 Accuracy of imputation (lA) per SNP ignoring 
pedigree relationship on chromosome 1. Imputation accuracy of 
experimental data as a function of the chromosomal positions of 
SNP using information on LD only. Generation Fq and Fi genotyped 
at high density and F2 genotyped at low density with the 
relationships between Fq, Fi and F2 omitted. 



Minor allele frequency (MAF) 

The measure of accuracy based on counting the number 
of alleles correctly imputed is sensitive to the allelic fre- 
quency [8,12,30]. In the current study, the square of the 
correlation (R^) between observed and imputed genotypes 
was used as a robust measure of accuracy of imputation. 
It is worth noting that the scale of this measure is some- 
what different from the one derived from AI (Table 2). 

MAF using the 9K panel in the F2 

Figure 6 shows that the MAF of the imputed SNP was not 
related to in these data. Notice also that alleles with ex- 
treme frequencies (MAF < 0.1) can be imputed with ac- 
curacy similar to those SNP at intermediate frequencies 
(MAF > 0.3). 

Distance to the closest tagSNP 

No differences in were found for the range of distances 
between non-tagSNP and tagSNP observed (average was 
equal to 0.936 Mb). Therefore, for an average density be- 
tween tagSNP of 0.26 Mb, R^ is similar for a SNP that is 
in the middle of the interval than for a SNP that is close 



to the tagSNP (Figure 7). This observation suggests that 
the density of tagSNP was enough to attain a reasonably 
equal R^ for all SNP within the interval. 

Effect of the difference in allelic frequencies in the Fq 

The difference in allelic frequency between founding popu- 
lations does not seem to affect the R^, This means that 
even SNP that segregate at very different frequencies in 
founders can be imputed with high accuracy as revealed in 
Figure 8. Moreover, the apparent drop in R^ for MAF dif- 
ferences over 0.75 presented in Figure 8 is largely an arte- 
fact of very small number of SNP used in the smoothing 
line fit. 

Discussion 

SNP selection methods and accuracy of imputation 

A main goal of the present research was to evaluate accur- 
acy of imputation in an F2 cross of pigs (Duroc x Pietrain) 
using different genotyping scenarios. In a first stage, lA was 
calculated from simulated F2 data. An ideal situation for 
linkage based imputation would be to select SNP equally 
spaced based on genetic distance, as the possibility of re- 
combination between imputed SNP and tagSNP would be 
minimal. However, this is not possible in the absence of a 
high resolution linkage map. Consequently, to position the 
tagSNP we used two proxies: a) physical spacing, and b) 
LD-based selection. For our simulated population, the two 
proxies produced the same results, most likely because it 
was assumed that IcM = 1 Mb uniform recombination 
rate. Therefore, in this simulated population, the average 
distance between tagSNP throughout the genome proved 
to be a good indicator of accuracy of imputation (M), as 
values greater or equal to 0.97 were obtained using average 
distances among tagSNP that were less than or equal to 2.1 
Mb. Next, the selection of tagSNP using the LD method 
was compared to choosing SNP located at regularly spaced 
intervals throughout the genome. In the first method, LD 
was measured by rj, the minimum threshold of between 
any non-tagSNP with at least one tagSNP. It was observed 
that when rj increased, the number of selected tagSNP 
and lA also increased. The accuracy was between 0.960 
(^ = 0.1) and 0.982 (^ = 0.5), with average distance be- 
tween tagSNP of 1.86 Mb and 0.16 Mb, respectively. Xu 
et al. [28] used ^ = 0.8 to select a set of tagSNP for 
genome-wide association analyses in humans. Their use 



Table 2 Imputation accuracy of SNP on chromosome 12 measured by lA or by 



Scenario 




Genotype design 




Accuracy of imputation 




Grandparents 


Parents 


Progeny 


lA 


1 


HD 


HD 


LowD 


0.962 0.884 


2 


HD 


LowD 


LowD 


0.833 0.408 



Comparison of two measures of accuracy of imputation calculated using experimental data (tagSNP panel = 30 SNP), under two situations: 1) Fq and Fi at high 
density (HD), F2 at low density (LowD), 2) Fq at HD, Fi and F2 at LowD. 
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Figure 6 Imputation accuracy (R^) for SNP on chromosome 12 as a function of the minor allele frequency in the Fq. Accuracy of 
imputation of experimental data as a function of minor allele frequency of each SNP (blue dots). Local regression fit (red dots). 
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Figure 7 Imputation accuracy (R^) for SNP on chromosome 12 as a function of the distance to the closest tagSNP. Blue dots are non- 
tagSNP (Experimental data); distance in base pairs (Log 10). Local regression fit is displayed by the red dots. 
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Figure 8 Imputation accuracy (R^) for SNP on cliromosome 12 as a function of the difference in allelic frequencies. Accuracy of 
imputation using experimental data for tlie difference in allelic frequencies (blue circles) between founding breeds (Pietrain and Duroc). Local 
regression fit (red dots). 



was slightly different from ours in that they were selecting 
SNP to tag causative variants for genome-wide association 
using population level LD information only. On the other 
side, we wanted to use this method to select SNP that were 
more evenly spaced in terms of genetics distance as done 
previously with outbred pig populations [10], but this time 
exploiting within and between family LD. Consequently, 
low levels of ij were used in the current study as we found 
that with a threshold of /^>0.6, many tagSNP were se- 
lected with marginal increases of M. The second method 
employed to select tagSNP consisted of dividing the 
chromosome into segments of equal size, and then choos- 
ing the SNP that lay closest to the center of the segment. 
Other studies have used evenly spaced tagSNP by selecting 
one SNP every given number of markers [12], or by choos- 
ing in each segment the SNP with the largest MAF 
[4,11,12]. The fact that we had available a sizable number 
of SNP throughout the genome, i.e. 60 K, made it possible 
to select approximately evenly spaced SNP with a wide 
range of MAF, as long as those SNP were segregating in 
the population. The values of lA calculated while using 
tagSNP chosen at evenly spaced segments were similar to 
those obtained using the LD method. This similarity of re- 
sults may be due to an assumption made in the method of 
SNP selection at evenly spaced intervals, i.e. that the distri- 
bution of LD along the genome is almost uniform and 
there are no large blocks of LD. In the current research, 
the haplotypes of Fi animals are sampled from two 



populations: Duroc and Pietrain. The resulting LD was 
relatively high and uniformly distributed, except for a few 
blocks with extremely high LD: blocks with at least 7 con- 
secutive SNP with > 0.8. For this reason, evenly spaced 
tagSNP and tagSNP selected based on the LD method 
produced similar imputation accuracy at equivalent dens- 
ity. Although we indeed simulated assuming uniform re- 
combination rates, these results seem to agree also with 
experimental data, where the two methods of selection 
used here produced virtually the same accuracy in an out- 
bred pig population [10]. Designing custom low density 
SNP panels for each population of interest would not be 
cost effective. Consequently, we investigated the imput- 
ation accuracy obtained using a commercially available 
SNP chip with markers selected based on physical position 
and MAF [10]. 

Imputation using 9K panel and genotyping scenarios 

Data from a 9K chip (average distance between SNP = 
0.30 Mb) were used as a LowD panel to impute to a HD 
60K panel. Using the experimental data from F2 individ- 
uals, different genotyping scenarios were tested. In the 
first scenario, data consisted of Fq and Fi genotypes at HD 
and F2 at LowD, and average lA was 0.99. Similarly, 
Weigel et al. [13] imputed 8K genotypes to 43K using in- 
formation of the sire, dam, and grandsires (paternal and 
maternal), and obtained a value of lA > 0.95. 
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Our second scenario included the Fi genotyped at 9K, 
between the generations of grandparents and grand- 
offspring, and it was observed that lA of F2 decreased to 
0.9. In our last scenario Fq and Fi were genotyped at HD 
and F2 at LowD but the relationships between the F2 and 
the reference panel were ignored, resulting in an average 
accuracy of imputation of 0.9. Badke et al. [10] used the ge- 
notypes of a reference population formed by trios to im- 
pute genotypes of an unrelated population, and obtained 
values of lA of 0.90 and 0.95 using reference groups of 16 
and 64 animals, respectively. 

Habier et al. [6] indicated that the reasons for the decay 
in accuracy of imputation are two-fold: 1) the accuracy of 
haplotyping the tagSNP flanking the non-tagSNP; 2) the ac- 
curacy of haplotyping the imputed non-tagSNP, conditional 
on a correct haplotyping of the tag-SNP. Therefore, the im- 
pact of both factors under the first two scenarios and taking 
into account the relationships between the individuals in 
the F2 and in the reference population, were evaluated by 
means of simulated data. Accuracies of haplotyping were 
calculated as the number of erroneous inference of phase 
between consecutive heterozygous markers, as in Druet 
and Georges [31]. In all scenarios, it was observed that the 
phases of tagSNP were correct, thus the uncertainty was 
due to the grandparental origin of the non-tagSNP that 
were flanked by the tagSNP. The next step was to quantify 
the fraction of non-tagSNP with uncertain phase. When Fq 
and Fi were genotyped at HD and F2 was genotyped at 
LowD, the fraction of non-tagSNP with uncertain phase 
was 4%, whereas this statistic was 30% when the Fq was ge- 
notyped at HD, and the Fi and F2 were genotyped at LowD. 
The corresponding lA were 0.97 and 0.85, respectively. 
These results suggest that accuracies of imputation in the 
current study were affected by knowledge of the phase of 
non-tagSNP. Moreover, when the amount of genotypes 
from related individuals (i.e., Fq at HD) increases, the accur- 
acy of haplotyping goes up, and the accuracy of imputation 
also increases. These results apply to genotyping designs 
with a pedigree with a small number of founder individuals 
genotyped in HD and a large number of progeny genotyped 
in LowD. If the phase is known in the founders, it is easy to 
accurately follow transmission of chromosomal segments 
to the remainder of the population using linkage informa- 
tion. In practice, however the phase needs to be ascertained 
using LD information. Such information is very limited in 
cases such as our Fq because of reduced sample size. In that 
case, the researchers can follow two paths. First, as 
presented with large pedigrees, having extra animals from 
the same founding population(s) can help in using LD to 
accurately phase those animals. Second, as presented here, 
two consecutive generations can be genotyped in HD to 
use the information in grand-parents (Fq) to accurately 
phase the parents (Fi) and then use linkage information to 
impute genotypes within the progeny (F2). For such 



approaches to work, full pedigree information (three gener- 
ations) and two generations of HD genotypes are needed. 
The approach is still cost effective in typical F2 populations 
[6,32]. These results are partially reaffirmed in large pedi- 
gree based imputation. 

MAF effect 

The measures of accuracy of imputation that are based 
only on allelic counts are not useful for comparing SNP 
having different values of MAF. This is due to the fact that 
imputation errors are highly sensitive to the value of the 
allelic frequencies [8,12,30]. To overcome this restriction, 
two alternative measures of accuracy of imputation have 
been proposed: 1) the correlation between imputed and 
observed genotypes [8]; and 2) an accuracy of imputation 
corrected to its expected value [12,30]. The second 
method consists of adjusting the calculated accuracy of 
imputation by the difference between the observed accur- 
acy and an estimate of the expected value under random 
sampling. There are several possible ways of calculating 
the accuracy under this method. Regardless of the meas- 
ure being used to calculate the accuracy, a trend for the 
accuracy of imputation to drop when MAF < 0.15 has 
been observed. For example, in maize Hickey et al. [8] ob- 
served a decrease in when MAF < 0.10, and the drop 
was higher when the masked genotypes were >84% of 
total SNP. Similarly, Lin et al. [30] used human data with 
the correction for expected accuracy and observed a 
marked decrease in accuracy of imputation when MAF < 
0.15. Hayes et al. [12] used the same correction as Lin 
et al. [30] with sheep data and found highly variable accur- 
acies of imputation but tending to decrease whenever 
MAF < 0.10. The correlation between observed and im- 
puted genotypes {R^) was employed in the current re- 
search to evaluate the effect of MAF on imputation 
accuracy. Our results showed that markers with MAF < 
0.10 in the founders were imputed with reasonably good 
accuracy in the F2 (Figure 6), a result different from those 
previously discussed. This is not unexpected considering 
we used both LD and linkage (pedigree information), as 
sources of information from our crossbred population. 
Therefore, the allele frequency in the Fq does not matter 
as long as in that generation the two alleles are segregat- 
ing. Moreover, whenever the Fi is genotyped at HD, SNP 
with low MAF can be observed in the Fq and F^. Coupled 
to the fact that all family relationships are known, this 
simplifies the imputation of F2 animals. 

Possible effects In association 

In the current research we compared allelic dosage of ob- 
served and imputed genotypes to find accurate genotyping 
design and imputation methods for LowD genotypes in an 
F2 population. Zhen et al. [33] reported that the regression 
of phenotype on allelic dosage was an accurate method to 
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evaluate QTL effects. Moreover, they observed that when 
accuracies of imputation were high, the power for the as- 
sociation test was high. For example, accuracies of imput- 
ation > 0.95 were associated with values of power > 0.85. 
In the current study, the accuracy of imputation obtained 
with the 9K panel was = 0.94, which suggests that the 
power for an association test is high. Other studies also 
found that imputation improved the power for association 
tests. Using data from humans, Hao et al. [34] compared 
the power for GWAS analysis of four different strategies 
involving imputation: (1) directly testing for associations 
using the Illumina 317K SNPs, (2) testing for associations 
using the entire imputed HapMap SNP set based on the 
Illumina 317K genotype data; (3) directly testing for asso- 
ciations using the Illumina 650Y SNPs; and (4) testing for 
associations using the entire imputed HapMap SNP set 
based on Illumina 650Y genotype data. It was observed 
that genomic wide imputation (strategies 2 and 4) im- 
proved power by 5.5% for the Illumina 317K, or 3.3% for 
Illumina 650Y, compared to the analyses with assayed 
SNPs only (strategies 1 and 3, respectively). Similar results 
were obtained by Anderson et al. [5] for the 300K and 
550K platforms. 

The cost of genotyping is an important consideration. 
At present, the cost of commercial HD genotyping (60K) 
for pigs is more than twice as much as the cost of geno- 
typing with the 9K chip. Assuming a population with a 
structure similar to the one used here (approximately 20 
Fo, 56 Fi and 1000 F2), one can genotype 1.9 times more 
individuals in a scenario with Fq and Fi at HD, and F2 at 
LowD than in a scenario with Fq, Fi and F2 at HD. The 
imputed genotypes can then be used for association or for 
meta-analysis studies. 

Conclusions 

Designing custom SNP panels for each F2 population to 
be imputed will likely not be cost effective due to the rela- 
tively large number of SNP needed to attain reasonable 
imputation accuracies, and the high development costs for 
each SNP panel. In particular, for our population we 
would need a minimum of M = 1,200 markers with aver- 
age distance of 2.1 Mb to have lA over 0.97 in the F2. On 
the other hand, using the 9K panel as tagSNP (LowD) 
resulted in lA of 0.99 when the Fq and Fi were genotyped 
at HD and the F2 at LowD. The cost of such genotyping 
scheme would be less than half the cost of using HD ge- 
notypes for all individuals. The correlation between ob- 
served and imputed genotypes was high (7^ = 0.94), so 
that the power for future association studies would be 
high. Thus, under a genotyping strategy of high accuracy 
of imputation (i.e., Fq and Fi at HD, F2 at LowD), informa- 
tion on imputed genotypes from more animals that is 
similar to that from a HD panel can be obtained at a lower 
cost. These results apply to the imputation of markers in 



the SNP60 beadchip, in populations where a small num- 
ber of founders can be genotyped at HD and phase of par- 
ents of imputed animals can be derived with certainty. 
Translation of LD-based results, on the other hand, are 
constrained to pig populations showing similar levels of 
LD as in the founding animals [35]. 

Methods 

Animals 

The experimental population was raised at the Michigan 
State University Swine Teaching and Research Farm, East 
Lansing, MI [1]. Parents from the initial generation (Fq) 
were four unrelated Duroc boars mated to 15 Pietrain 
sows by artificial insemination. From all resulting Fi ani- 
mals, 50 females and 6 males (progeny of 3 Fq sires) were 
selected as parents for the F2 generation, by avoiding full 
or half sib matings. A total of 1,259 F2 piglets were born 
alive from 142 litters out of 11 farrowing groups. Animal 
protocols were approved by the Michigan State University 
All University Committee on Animal Use and Care (AUF# 
09/03-114-00). 

Genotyping and data editing 

DNA was isolated from white blood cells using standard 
procedures as we have previously described for this popu- 
lation [1]. Quantity and quality of DNA samples were de- 
termined using a Qubit fluorometer (Invitrogen by Life 
Technologies, Carlsbad, CA, USA). The number of geno- 
typed animals was A/' =411 (4 Fq Duroc boars, 15 Fq 
Pietrain sows, 6 Fi males, 50 Fi females and 336 F2 pigs). 
Genotyping was performed at a commercial laboratory 
(GeneSeek, a Neogen Company, Lincoln, NE, USA) 
using the Illumina PorcineSNP60 beadchip [36]. Out 
of Af = 62,163 SNP, 6,422 SNP were eUminated as their 
physical positions were unknown. Mendelian inconsist- 
encies (< 0.01%) were taken as missing genotypes, and 
12 animals (1 Fi and 11 F2) with more than 10% of SNP 
missing were not used in any analysis. By similar consider- 
ation, 3,038 SNP were removed from the analyses due to 
presenting more than 10% missing data. Additionally, 
10,139 SNP were excluded as their minor allele frequency 
(MAF) was below 0.01. These editing policies resulted in a 
data set comprising 399 pigs with 45,003 SNP per animal. 
This editing procedure followed that of Badke et al. [35] 
and the program PLINKvl.07 [37] was used. Additionally, 
starting with genotypes for Fq and Fi animals, genotypes 
for 932 F2 animals were simulated conditional on the real 
pedigree using a gene-dropping model. Simulated geno- 
types were used to assess alternative tagSNP selection 
procedures while experimental genotypes on a subset of 
animals (n = 336) were used to assess imputation accuracy 
using a SNP list for a 9K commercial chip that has re- 
cently been publicly released by GeneSeek Inc. (Lincoln, 
NE, USA; described in Badke et al. [10]). 
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Genotype simulation 

A stochastic simulation was performed to evaluate two dif- 
ferent methods of selecting tagSNP for imputation on the 
accuracy of the resulting F2 genotypes. The genotypes of 
932 F2 animals were simulated using gene-dropping [38] 
theory, by conditioning on a real pedigree and on the hap- 
lotypes of the 55 Fi parents (6 males and 49 females) from 
the real F2 population. The haplotypes were estimated at a 
high accuracy from the genotypes of the Fi parents and 19 
Fo ancestors (4 Duroc boars and 15 Pietrain sows), using 
the software MERLIN [39]. The number of recombinations 
in the Fi haplotypes were drawn from a Poisson distribu- 
tion with mean equal to the length of the given chromo- 
some in Morgans (M) by assuming 1 Mb = 1 cM [40]. The 
positions of the recombinations were simulated from a 
uniform distribution using Haldanes mapping function 
[41,42]. As an example, there were 1,405 SNP on chromo- 
some 12 that were spread over 64.2 Mb, and the ensuing 
average distance between markers was 0.04573 Mb. By as- 
suming a recombination rate of 1 cM per Mb [38], the 
number of recombinations in chromosome 12 was drawn 
from a Poisson distribution with parameter equal to 64.2 / 
100 = 0.642. The next step was to assign the resulting gam- 
etes carrying these recombinations of the Fi genotypes to 
their F2 progeny. 

TagSNP selection using simulated dataset 

Two different methods were used for tagSNP selection: 1) 
The first one consisted of a statistical search built into the 
software FESTA [43] and used information on LD [44]. In 
this method, each SNP was either an element of the 
tagSNPset, or in LD with an existing element in the 
tagSNPset, at a value equal or larger than a specified 
threshold (^) [10]. A minimum level of based on pair- 
wise LD of the Fi haplotypes was selected, so that all SNPs 
above the chosen threshold were selected as tagSNP. 2) 
The second method consisted of selecting evenly-spaced 
markers. The chromosome was divided into k segments of 
equal length, and then the SNP that was closest to the cen- 
ter of the segment was selected. In cases where there were 
no SNP lying in a segment, no selection was performed 
resulting in the number of tagSNP</: in segments of ap- 
proximately equal length. 

Genotype imputation 

For simulated data, F2 genotypes of non-typed markers 
were imputed using the algorithm of Lander and Green 
[39] that predicts the non-tagSNP by conditioning on the 
observed markers. For computational reasons the pedigree 
was analyzed on a per litter basis. Thus, for each F2 litter, 
a three generation pedigree was built [45] using the four 
Fo grandparents, the two Fi parents, and up to a max- 
imum of 10 F2 animals. When the litter had more than 10 
progeny, a new "family" was formed with the four Fq 



grandparents, the two Fi parents and the remaining F2 an- 
imals. The resulting "families" were analyzed separately 
and genotypes were imputed with MERLIN [39] . Breaking 
the pedigree in this way produces some loss of informa- 
tion, but simulation results (data not shown) suggested 
that the loss was negligible. 

For experimental data, F2 genotypes of non-typed 
markers were imputed using the algorithm built into the 
software Alphalmpute [4]. The algorithm implemented 
in Alphalmpute [4] uses information on population-wide 
and within family LD and it required certain tuning. In 
particular, we set the core length parameter to 100, 150, 
400 and 600 SNP and the tail parameter haplotype to 300, 
400, 600 and 800 SNP, respectively. Likewise, genotype 
error percentage parameter was set to 0%, so as to obtain 
a high percentage of alleles under the correct phase [46]. 
The algorithm was run for the entire pedigree as there 
was no computing restriction in this case. 

Calculation of the accuracy of imputation 

Irrespective of data generation (simulation or experimen- 
tal), the accuracy of genotype imputation in F2 individuals 
for all methods was evaluated using two different statistics. 
First, the mean of the difference between observed and im- 
puted allelic dosage was calculated [9,13] as follows: 



lA 



In this expression, N is the total number of animals im- 
puted. Mi represents the number of markers with 
observed genotype in animal /, is the observed (experi- 
mental or simulated) allelic dosage in animal / and SNP 
and is the corresponding imputed allelic dosage. Allelic 
dosage was defined as the number of copies of a reference 
allele that took values 0, 1 and 2 for homozygous reference, 
heterozygous and homozygous non-reference, respectively. 
The second expression used to quantify the imputation ac- 
curacy was the square of the correlation between observed 
and imputed genotypes at each allele, or statistics of 
Huang et al. [47]. Denoting g, the average value of the im- 
puted genotypes, and with g the average value of observed 
genotypes, the statistics were calculated as follows: 
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The statistic is interpreted as a squared correlation 
coefficient. 
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